-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] New embeddings API #1023
Conversation
The Census version tag, e.g., ``"2023-12-15"``. | ||
|
||
Returns: | ||
A list of dictionaries, each containing metadata describing an available embedding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Options:
- Return a subset of the metadata that only has relevant information (name, organism, etc). The example listed here is only for reference
- Return the full metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strongly prefer 1) and having a verbose
argument
api/python/cellxgene_census/src/cellxgene_census/_get_anndata.py
Outdated
Show resolved
Hide resolved
@@ -18,3 +19,7 @@ def _uri_join(base: str, url: str) -> str: | |||
p_url.fragment, | |||
] | |||
return urllib.parse.urlunparse(parts) | |||
|
|||
def _extract_census_version(census: soma.Collection): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I created a live corpus unit test for this method. This should ensure that this parsing method remains consistent across releases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this code has some lint - please run (an up to date) pre-commit across it
for emb in add_obs_embeddings: | ||
emb_metadata = get_embedding_metadata_by_name(emb, organism, census_version, "obs_embedding") | ||
uri = f"{CENSUS_EMBEDDINGS_LOCATION_BASE_URI}/{census_version}/{emb_metadata['id']}" | ||
embedding = get_embedding(census_version, uri, obs_soma_joinids) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this will cause the census object to be re-opened. While this shouldn't be an issue, it will result into an extra call. With some effort I can refactor get_embedding
to also accept an existing Census object, but I'm not sure if it's worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, you should refactor the code to have a (common, shared) function that accepts an already open Census handle
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real concerns with functionality, but suggest you do an API review with Pablo and Mike to figure out if the signatures are friendly/comprehensible from a UX perspective
Sorry, my clumsy fingers clicked the wrong button! |
|
||
if add_obs_embeddings: | ||
if obsm_layers and [x for x in add_obs_embeddings if x in obsm_layers]: | ||
raise ValueError( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the (high) cost of calling query.to_anndata(), you should do all error checking before the costly ops - ie., move this kind of stuff to a prologue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
obs_soma_joinids = query.obs_joinids() | ||
for emb in add_obs_embeddings: | ||
emb_metadata = get_embedding_metadata_by_name(emb, experiment_name, census_version, "obs_embedding") | ||
uri = f"{CENSUS_EMBEDDINGS_LOCATION_BASE_URI}/{census_version}/{emb_metadata['id']}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't these use urljoin()?
census_directory = get_census_version_directory() | ||
|
||
if add_obs_embeddings: | ||
if obsm_layers and [x for x in add_obs_embeddings if x in obsm_layers]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this error check (name collision) is missing from the var axis. Seems like you need it for both
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed varm_layers
from the arguments, as you suggested, since it is not in the current API. The only way you can request a varm is through add_varm_embeddings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - gotcha!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, I'm thinking about this previous conversation a bit more. Coming around to a slightly different perspective. Given:
- tiledbsoma/somacore support more functionality than is exposed here (e.g., varp_layers, etc)
- Census doesn't currently use these, due to schema definition (not code)
- we want to clearly separate the arg names for clarity (the previous conversation where we decided to remove
varm_layers
)
I think we should land roughly here, to keep the code & schema modular:
- this function (
get_anndata
) should pass through all (or at least most) arguments supported by ExperimentAxisQuery.to_anndata. That makes it future-proof and decoupled from the schema - the newly added args should have clearly separated names (the current names are not clear per above comment)
- the error checking needs to detect collisions because AnnData only has one "dict" to shove them all in, and they might conflict.
Boiling this down, I suggest:
- Add (and pass to query.to_anndata) the args
obsm_layers
,varm_layers
,obsp_layers
, andvarp_layers
. These are just pass through. - Rename
add_obs_embeddings
andadd_var_embeddings
to something clear (see above discussion) - Do the error checks for collisions, and do them before you do any data loading. Example below.
Example error check for obs:
if set(obsm_layers) & set(add_obs_embeddings):
... there is a collision error ...
....do same for varm ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good.
api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py
Show resolved
Hide resolved
response = requests.get(CELL_CENSUS_EMBEDDINGS_MANIFEST_URL) | ||
response.raise_for_status() | ||
|
||
versions = set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this whole blob could be a simple comprehension which might make it more pythonic (this is a nit, up to you). E.g,
return sorted({ obj['census_version'] for obj in manifest.values() if ... })
And unless there are duplicates expected, I'm not sure what the set adds? If there are duplicates, doesn't that imply you need more filter criteria?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set is because multiple embeddings can exist for a single alias, and in this case we're only interested in the census version string, so it needs to be deduplicated. I'll rewrite using the comprehension.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1023 +/- ##
==========================================
+ Coverage 81.33% 82.32% +0.99%
==========================================
Files 73 74 +1
Lines 5566 5714 +148
==========================================
+ Hits 4527 4704 +177
+ Misses 1039 1010 -29
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebezzi have you considered the convenience API to get all available Census version for a given embedding name?
The Census version tag, e.g., ``"2023-12-15"``. | ||
|
||
Returns: | ||
A list of dictionaries, each containing metadata describing an available embedding. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strongly prefer 1) and having a verbose
argument
api/python/cellxgene_census/src/cellxgene_census/experimental/_embedding.py
Show resolved
Hide resolved
@pablo-gar |
New embeddings API. Provides a unified access pattern to embeddings (regardless of whether they're collaboration or hosted) through
get_anndata
.